Grammar Induction by Distributional Clustering with the Fragment Constituency Criterion
نویسنده
چکیده
This paper proposes that the identification of constituents, which is the core problem in grammar induction, can be accomplished by a simple constituency criterion in linguistics: a word/tag sequence which can occur as a fragment is a constituent. Experiment results show that grammar induction by distributional clustering augmented with this criterion achieves good PARSEVAL scores and improves phrase-based statistical machine translation.
منابع مشابه
Unsupervised induction of stochastic context-free grammars using distributional clustering
An algorithm is presented for learning a phrase-structure grammar from tagged text. It clusters sequences of tags together based on local distributional information, and selects clusters that satisfy a novel mutual information criterion. This criterion is shown to be related to the entropy of a random variable associated with the tree structures, and it is demonstrated that it selects linguisti...
متن کاملDistributional phrase structure induction
Unsupervised grammar induction systems commonly judge potential constituents on the basis of their effects on the likelihood of the data. Linguistic justifications of constituency, on the other hand, rely on notions such as substitutability and varying external contexts. We describe two systems for distributional grammar induction which operate on such principles, using part-of-speech tags as t...
متن کاملJoint Learning of Constituency and Dependency Grammars by Decomposed Cross-Lingual Induction
Cross-lingual induction aims to acquire for one language some linguistic structures resorting to annotations from another language. It works well for simple structured predication problems such as part-of-speech tagging and dependency parsing, but lacks of significant progress for more complicated problems such as constituency parsing and deep semantic parsing, mainly due to the structural non-...
متن کاملCorpus-Based Induction of Syntactic Structure: Models of Dependency and Constituency
We present a generative model for the unsupervised learning of dependency structures. We also describe the multiplicative combination of this dependency model with a model of linear constituency. The product model outperforms both components on their respective evaluation metrics, giving the best published figures for unsupervised dependency parsing and unsupervised constituency parsing. We als...
متن کاملUnsupervised Grammar Induction by Distribution and Attachment
Distributional approaches to grammar induction are typically inefficient, enumerating large numbers of candidate constituents. In this paper, we describe a simplified model of distributional analysis which uses heuristics to reduce the number of candidate constituents under consideration. We apply this model to a large corpus of over 400000 words of written English, and evaluate the results usi...
متن کامل